Bayesian sampling of integration dates in the latent HIV reservoir

Roux-Cil Ferreira, Emmanuel Wong and Art Poon

Department of Pathology and Laboratory Medicine, Western University, Canada

Dating HIV integration events

  • Replication-competent provirus from the latent reservoir can be sequenced from viral outgrowth experiments.
  • One approach is to build a phylogeny and map outgrowth sequences to dated plasma HIV RNA.
  • Requires extensive sampling of HIV-1 plasma RNA pre-ART - uncertainty in the tree and location of root.
Image source: Abrahams et al. (2019) The replication-competent HIV-1 latent reservoir is primarily established near the time of therapy initiation. Science Trans Med 11(513).

Root-to-tip regression

  • Under a strict molecular clock, the expected number of mutations increases linearly with time.
  • A root-to-tip (RTT) method regresses divergence from the root (\(y\)-axis) on sample collection dates (\(x\)-axis).
    • Fit regression to HIV RNA only, use model to predict integration dates.
    • May be more robust to uncertainty in tree (dating is a function of lineage-specific divergence).
Image source: Buonagurio DA, et al.. Evolution of human influenza A viruses over 50 years: rapid, uniform rate of change in NS gene. Science. 1986 May 23;232(4753):980-2.

Problems with RTT dating

  • Ignores variation in number of mutations — there is one, and only one, estimated date.
    • Unclear how to deal with divergent outgrowth sequences.
    • Proviral sequences can be mapped to post-ART period or the future (right, red zone).
  • Assumes location of root and clock rate are known without error.

A simple Bayesian model

  • The RTT regression model has three parameters:

    1. The date of the origin (\(x\)-intercept);
    2. The molecular clock (rate/slope, \(\mu\));
    3. The location of the root in the tree (determines divergence, \(y\))
  • We will sample these parameters from the posterior distribution.

    • Continue to assume the unrooted tree is known without error (constrains positioning of root).

Poisson likelihood

  • A strict clock assumes that mutations accumulate at a constant rate \(\mu\) over time, so variation is equal to the mean (Poisson).
    • Let \(Y_i\) be the number of mutations in the \(i^{\textrm{th}}\) sequence, given location of root.
    • Let \(\Delta t_i\) be the time elapsed between the \(i^{\textrm{th}}\) sample and the root.
  • The log-likelihood for RNA set \(\{Y_i, \Delta t_i\}\) is:

\[\log L(Y_i, \Delta t_i) = \sum_i Y_i\log(\mu \Delta t_i) - \mu \Delta t_i - \log \Gamma(Y_i+1)\]

  • This is the Langley-Fitch (1974) model.

Metropolis-Hastings sampling

Parameter Prior Proposal
Root Uniform \(\textrm{Unif}(-\delta, +\delta)\), reflection on tips and random choice of branches at splits.
Clock rate Lognormal \(\textrm{Unif}(-\delta, +\delta)\) proposal reflecting on zero.
Origin date Uniform Truncated normal proposal with mean 0 and variance \(\sigma\).

Sample run

Step 180 Step 19980

Sampling integration dates

  • Using Bayes’ rule, the probability of integration time \(t_i\) for outgrowth sequence with divergence \(Y_i\) is: \[P(t_i|Y_i) = \frac{P(Y_i|t_i) P(t_i)}{P(Y_i)}\]

  • We assume \(P(t_i) = \frac{1}{T-t_0}\), where \(t_0\) is the origin date and \(T\) is the maximum integration date, and letting \(s=t-t_0\):

\[ P(Y_i) = \frac{\int_0^{T-t_0} (\mu s)^{Y_i} \exp(-\mu s) \mathrm{d}s}{(T-t_0)\Gamma(Y_i+1)} = \frac{\gamma(Y_i+1, \mu(T-t_0))}{\mu(T-t_0)\Gamma(Y_i+1)} \]

where \(\gamma(s, x)\) is the lower incomplete gamma function.

Sampling integration dates (2)

  • Letting \(\Lambda=\mu(T-t_0)\), the probability of \(t_i\) given \(y_i\) mutations simplifies to: \[ P(t_i | Y_i) = \frac{\mu \Lambda^{y_i}\exp(-\Lambda)}{\gamma(Y_i+1, \Lambda)} \]

  • Since we can’t solve for the inverse CDF, we use a simple rejection method to sample integration times.

  • \(t_0\), \(y_i\) and \(\mu\) are sampled from the posterior distribution.

Simulation model

  • We used an exact stochastic simulation module in R (twt) to simulate cell population dynamics (forward time).
  • Discrete events with exponential waiting times determine population dynamics over time.

Simulating population dynamics

  • Branching events require a “source” cell to induce a “target” cell to change state.
    • Transmission of virus from an infected cell to a susceptible cell.
    • Virus replication completely blocked by ART initiation.
  • Transition events occur when a cell switches from one state (compartment) to another.
    • e.g., reactivation of a latently-infected cell.

Simulation scenarios

  1. Early sampling:
    • Sampled 10 HIV-1 RNA lineages at times \(t=3\), 6 and 9 post-infection.
    • Initiated ART at time \(t=10\).
    • Sampled 10 HIV-1 DNA lineages at time \(t=20\).
  2. Later sampling:
    • Sampled 10 HIV-1 RNA lineages at times \(t=11\), 13 and 15 post-infection.
    • Initated ART at time \(t=15\).
    • Sampled 10 HIV-1 DNA lineages at time \(t=20\).

Simulating trees

  • From the population size trajectories, twt samples trees in reverse time for given sampling times and cell types.
    • ▲ = branching event (infection from active cell, or clonal expansion of latent cell)
    • ○ = state transition (from active to latent or vice versa)
  • Retaining these tree annotations lets us collapse branches associated with latently-infected cells.

Simulating evolution

  • Ran trees from twt through INDELible
    • TN93 nucleotide substitution model with 40% As.
    • Seeded with an HIV-1 subtype C sequence at the root (AY772699)
  • Reconstructed trees from simulated alignments with FastTree2 (double precision build).
  • Input trees to bayroot, ran chain samples for \(2\times 10^4\) steps
    • Discarded burn-in of 2,000 steps
    • Thinned remaining chain down to 1,000 samples.

Comparing bayroot to root-to-tip regression

  • For comparison, ran RTT on same input trees by censoring sampling times associated with DNA tips
    • Rooted tree using rtt function in R package ape
    • Extracted root-to-tip distances from resulting tree
    • Fit simple linear regression of these distances against sampling times
  • Calculated root mean square error between estimated (\(\hat{t}\)) and actual (\(t\)) integration dates:
\(\textrm{RMSE}=\sqrt{\left. \sum_{i=1}^n (\hat{t}_i - t_i)^2 \right/ n}\)

Early sampling

When clock signal is strong, advantage of bayroot is largely due to prior information about ART initiation (dashed line). Integration date estimates for one replicate simulation. Paired Wilcoxon sign-rank test, \(P=3.55\times 10^{-4}\), \(n=50\).

Later sampling

  • Advantage of bayroot driven by prior information about the root.
    • With a less informative prior, bayroot results become similar to RTT.
Paired Wilcoxon sign-rank test, \(P=3.82\times 10^{-7}\), \(n=50\).

Application to real data

ZM1044M, Zambia-Emory HIV Research Project

Brooks et al., 2020, PLOS Pathog 16(6): e1008378.

Limitations

  • We are assuming the input unrooted phylogeny is known without error.
    • This assumption could be relaxed by hierarchical sampling (input a posterior sample of trees)
  • We assume the divergence of each sequence is an independent outcome.
    • Sequences share common ancestors, will have a similar root-to-tip distance because they inherit the same mutations.
    • This could be addressed by adapting the covariance matrix of the regression to the phylogenetic structure of the data.
  • A simple and efficient means of incorporating prior knowledge and managing uncertainty in dating the HIV reservoir.
  • R package available from https://github.com/PoonLab/bayroot
This study was supported by funding from CIHR and NIH (REACH AI164565-01). RC was supported by an Ontario Genomics-CANSSI Fellowship in Genome Data Science.